Skip to content

Add FileScanConfigBuilder #15352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Mar 28, 2025
Merged

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Mar 21, 2025

Related to #14685 (comment)

Rationale for this change

FileScanConfig now violates single responsibility from SOLID. It serves two conflicting roles:

  • As a builder, though this should be changed as discussed in

    // TODO: This function should be moved into DataSourceExec once FileScanConfig moved out of datafusion/core

  • As a business logic provider (e.g., fn project, impl DataSource, etc.)

These conflicting roles lead to issues like #14905 and #14679, where provider features are accessed even before the build process is complete.

What changes are included in this PR?

I've added FileScanConfigBuilder and deprecated builder approach for FileScanConfig

Are these changes tested?

Yes, updated exiting tests into the new interface

Are there any user-facing changes?

Yes, new builder interface - but the switch is quite easy (all builder-methods from FileScanConfig are supported)

@github-actions github-actions bot added core Core DataFusion crate proto Related to proto crate datasource Changes to the datasource crate labels Mar 21, 2025
Comment on lines 247 to 250
FileScanConfig::new(object_store_url, self.schema(), source)
FileScanConfigBuilder::new(object_store_url, self.schema(), source)
.with_projection(projection.cloned())
.with_limit(limit);
.with_limit(limit)
.build();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mertak-synnada, hey, this PR is still WIP but I was wondering if you're happy with this approach. That's what we've discussed in #14685 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far, thank you! I haven't been able to test it thoroughly yet, but the legacy ParquetExecBuilder could be helpful for understanding specific cases, just FYI.

@github-actions github-actions bot added the substrait Changes to the substrait crate label Mar 24, 2025
@blaginin blaginin changed the title WIP: Add FileScanConfigBuilder and switch some cases WIP: Add FileScanConfigBuilder Mar 24, 2025
@blaginin blaginin marked this pull request as ready for review March 24, 2025 21:22
@blaginin
Copy link
Contributor Author

also fyi @AdamGS 👀

@blaginin blaginin changed the title WIP: Add FileScanConfigBuilder Add FileScanConfigBuilder Mar 24, 2025
@blaginin blaginin requested a review from mertak-synnada March 24, 2025 21:24

// Finally, put it all together into a DataSourceExec
Ok(file_scan_config.build())
Ok(Arc::new(DataSourceExec::new(Arc::new(file_scan_config))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it possible to have this return as a function? Is it because of import cycles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean as aDataSourceExec function? It also looks a bit verbose to me, but the inner Arc is needed for dynamic dispatch, and the outer one makes the return type more explicit. Happy to make it DataSourceExec::new_arc if you want, but i don't think we use that a lot in datafusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant just a cosmetic change to not re-writing the whole Arc::new(DataSourceExec::new(Arc::new(file_scan))) Maybe it can be something like DataSourceExec::from_file_source(file_scan) -> Arc<DataSourceExec>

What would you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you like 748a335 ? Looks cleaner

@AdamGS
Copy link
Contributor

AdamGS commented Mar 25, 2025

my 2c - this looks great, I would love to also rename FileScanConfig to something else but I get the backwards compatibility concerns.

@blaginin blaginin requested a review from mertak-synnada March 26, 2025 14:25
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @blaginin -- this looks great to me. Thank you !

I think it is really nice that it is backwards compatible as well (deprecated old APIs rather than removing)

I have one request for some more documentation, but I think this one is really good now

@@ -174,6 +175,219 @@ pub struct FileScanConfig {
pub batch_size: Option<usize>,
}

#[derive(Clone)]
pub struct FileScanConfigBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add comments and examples here? Perhaps just pointing back to FileScanConfig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs for methods and example

@@ -326,14 +544,15 @@ impl FileScanConfig {
/// # Parameters:
/// * `object_store_url`: See [`Self::object_store_url`]
/// * `file_schema`: See [`Self::file_schema`]
#[allow(deprecated)] // `new` will be removed same time as `with_source`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice that it will deprecate the old API rather and will guide users during upgrade


// Finally, put it all together into a DataSourceExec
Ok(file_scan_config.build())
Ok(Arc::new(DataSourceExec::new(Arc::new(file_scan_config))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant just a cosmetic change to not re-writing the whole Arc::new(DataSourceExec::new(Arc::new(file_scan))) Maybe it can be something like DataSourceExec::from_file_source(file_scan) -> Arc<DataSourceExec>

What would you think?

@@ -645,6 +875,7 @@ impl FileScanConfig {

// TODO: This function should be moved into DataSourceExec once FileScanConfig moved out of datafusion/core
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also remove this TODO with above suggestion I believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@blaginin blaginin requested a review from mertak-synnada March 27, 2025 14:50
Copy link
Contributor

@mertak-synnada mertak-synnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looking good to me

@alamb alamb merged commit 4a3f70c into apache:main Mar 28, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 28, 2025

Thanks again @blaginin @mertak-synnada

qstommyshu pushed a commit to qstommyshu/datafusion that referenced this pull request Mar 28, 2025
* WIP: Add `FileScanConfigBuilder` and switch some cases

* Fmt

* More fmt

* Clean `FileScanConfig::build`

* Clean `FileScanConfig::new`

* Fix csv + fmt

* More fixes

* Remove pub

* Remove todo

* Add usage example

* Fix input type

* Add `from_data_source`

* Fmt

* Add docs for `with_source`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate datasource Changes to the datasource crate proto Related to proto crate substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants